In the document below we explore the given Beer and Brewery data. This includes some analysis of summary statistics followed by a look into the relationship between IBU and ABV and finally some additional exploratory data analysis with an additional data set.
First you can see the data we load along with the libraries used in this analysis.
#Loading Libraries
library(stringr)
library(tidyverse)
library(ggplot2)
library(GGally)
library(dplyr)
library(ggthemes)
library(class)
library(caret)
library(usmap)
# Read in data from Local Files
BeersData = read_csv("Beers.csv")
BreweriesData = read_csv("Breweries.csv")
Below you will see the histogram demonstrating how many breweries can be found in each state. You may see a few top states such as Colorado and California, but several other states have a decent presence of breweries.
#How Many Breweries in Each State?
library(tidyverse)
StatesCount = BreweriesData %>% group_by(State) %>% summarise(n = n())
StatesCount %>% ggplot(aes(x = State, y = n)) + geom_bar(stat="identity") + ggtitle("Number of Breweries per State") + ylab("Count of Breweries")
Next in order to view the entire picture we merge the Beer data with the Brewery data and clean the variable names so they are later easily referenced.
The first six and last six entries of the merged data set are also printed below.
#Merge the Data
BeerWithBreweries = merge(BeersData,BreweriesData, by.x = "Brewery_id", by.y = "Brew_ID")
colnames(BeerWithBreweries)[colnames(BeerWithBreweries)=="Name.x"] <- "Beer_Name"
colnames(BeerWithBreweries)[colnames(BeerWithBreweries)=="Name.y"] <- "Brewery_Name"
head(BeerWithBreweries, n = 6)
## Brewery_id Beer_Name Beer_ID ABV IBU
## 1 1 Get Together 2692 0.045 50
## 2 1 Maggie's Leap 2691 0.049 26
## 3 1 Wall's End 2690 0.048 19
## 4 1 Pumpion 2689 0.060 38
## 5 1 Stronghold 2688 0.060 25
## 6 1 Parapet ESB 2687 0.056 47
## Style Ounces Brewery_Name City
## 1 American IPA 16 NorthGate Brewing Minneapolis
## 2 Milk / Sweet Stout 16 NorthGate Brewing Minneapolis
## 3 English Brown Ale 16 NorthGate Brewing Minneapolis
## 4 Pumpkin Ale 16 NorthGate Brewing Minneapolis
## 5 American Porter 16 NorthGate Brewing Minneapolis
## 6 Extra Special / Strong Bitter (ESB) 16 NorthGate Brewing Minneapolis
## State
## 1 MN
## 2 MN
## 3 MN
## 4 MN
## 5 MN
## 6 MN
tail(BeerWithBreweries, n = 6)
## Brewery_id Beer_Name Beer_ID ABV IBU
## 2405 556 Pilsner Ukiah 98 0.055 NA
## 2406 557 Heinnieweisse Weissebier 52 0.049 NA
## 2407 557 Snapperhead IPA 51 0.068 NA
## 2408 557 Moo Thunder Stout 50 0.049 NA
## 2409 557 Porkslap Pale Ale 49 0.043 NA
## 2410 558 Urban Wilderness Pale Ale 30 0.049 NA
## Style Ounces Brewery_Name
## 2405 German Pilsener 12 Ukiah Brewing Company
## 2406 Hefeweizen 12 Butternuts Beer and Ale
## 2407 American IPA 12 Butternuts Beer and Ale
## 2408 Milk / Sweet Stout 12 Butternuts Beer and Ale
## 2409 American Pale Ale (APA) 12 Butternuts Beer and Ale
## 2410 English Pale Ale 12 Sleeping Lady Brewing Company
## City State
## 2405 Ukiah CA
## 2406 Garrattsville NY
## 2407 Garrattsville NY
## 2408 Garrattsville NY
## 2409 Garrattsville NY
## 2410 Anchorage AK
Next we will address the missing values. First a copy of the data is made so that we can compare the data before and after imputation.
After initial investigations we know ABV seems to have a strong relationship with IBU, we graph this relationship below. Note that we have warnings that are there due to the currently missing data.
#Address the Missing Values in Each Column
#Copy Made to compare differences after Imputation
BeersCopy = BeersData
#First the Exploration Graphs and Residuals of a Linear Model
ggplot(BeersCopy, aes(x = ABV, y = IBU)) +
geom_point () +
geom_smooth (method='lm') +
labs (
title = "ABV and IBU relationship",
subtitle = "Strong Positive Correlation",
x = "ABV",
y = "IBU")
## Warning: Removed 1005 rows containing non-finite values (stat_smooth).
## Warning: Removed 1005 rows containing missing values (geom_point).
Next given this relationship we can test a linear model fit. Below are the residuals and the results of a correlation test. The correlation is somewhat strong, enough so that given the information we have predicting IBU with this model is the best route forward.
#Linear Model
lmibu = lm(BeersData$IBU ~BeersData$ABV, data = BeersData)
#Demonstrate the Residuals
plot(resid(lmibu), main = "Residuals of a Linear Model of IBU based on ABV", ylab = "Residuals")
#Correlation Test to determine how good a linear model fit is.
cor.test(BeersData$ABV,BeersData$IBU, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: BeersData$ABV and BeersData$IBU
## t = 33.863, df = 1403, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6407982 0.6984238
## sample estimates:
## cor
## 0.6706215
Finally we will impute the missing data. First with the linear model we will impute the IBU column where there are ABV values. Going forward we must acknowledge that
After those imputations are complete we will impute the remaining missing IBU values and ABV values with the median of their respective columns. Imputing the median will slightly diminish variance, but should not affect the distribution much. We are accepting that diminished variance in exchange to include all the data.
#Given the above we will move forward with a linear model.
predicted = predict(lmibu, BeersData[,4])
NaTempIndices = which(is.na(BeersData$IBU))
for(i in NaTempIndices){
BeersData$IBU[i] = predicted[i]
}
#For the final 62 Missing Values we can impute the median to not lose the values.
#Determine Medians
MedianIBU <- median(BeersData$IBU, na.rm = TRUE)
MedianABV <- median(BeersData$ABV, na.rm = TRUE)
#Replace IBU
NAIndicesIBU = which(is.na(BeersData$IBU))
for(i in NAIndicesIBU){
BeersData$IBU[i] = MedianIBU
}
#ReplaceABV
NAIndicesABV = which(is.na(BeersData$ABV))
for(i in NAIndicesABV){
BeersData$ABV[i] = MedianABV
}
Finally we can see the change in IBU distribution with the density curves below. This change in distribution is important to note going forward.
#We can now see how the distribution changes here.
BeersCopy %>% ggplot(aes(x = IBU)) + geom_density() + ggtitle("Original IBU Density Curve")
## Warning: Removed 1005 rows containing non-finite values (stat_density).
BeersData %>% ggplot(aes(x = IBU)) + geom_density() + ggtitle("IBU Density Curve with Imputed Values")
#Note that the new data density has a peak above the original data spread and we should remember going forward that the IBU data is artificially inflated
#In order to ensure our combined data is correct we are re-merging it here
BeerWithBreweries = merge(BeersData,BreweriesData, by.x = "Brewery_id", by.y = "Brew_ID")
colnames(BeerWithBreweries)[colnames(BeerWithBreweries)=="Name.x"] <- "Beer_Name"
colnames(BeerWithBreweries)[colnames(BeerWithBreweries)=="Name.y"] <- "Brewery_Name"
Next using our complete data we will view the median alcohol content and international bitterness for each state.
The combined data is summarzied by both the median for IBU and median for ABV and then visualized below. ABV is quite uniform with most states having a similar median ABV. Utah is the exception with a much lower ABV than average. This makes some sense given the large population in Utah of Mormons who do not drink alcohol.
library(ggthemes)
MedianABVState = BeerWithBreweries %>% group_by(State) %>% summarise(MedABV = median(ABV, na.rm = TRUE))
MedianABVState %>% ggplot(aes(x = State, y = MedABV)) + geom_bar(stat = "identity", fill = "light green") +
ylab("Median ABV") + ggtitle("Median Beer ABV by State") + theme_clean()
MedianIBUState = BeerWithBreweries %>% group_by(State) %>% summarise(MedIBU = median(IBU, na.rm = TRUE))
MedianIBUState %>% ggplot(aes(x = State, y = MedIBU)) + geom_bar(stat = "identity", fill = "light blue")+
ylab("Median State IBU") + ggtitle("Median Beer IBU by State - Including Imputed Values") + theme_clean()
Given the large imputation of IBU it is prudent to also visualize the original spread of IBU as well as the full data above. Maine and a few other states have smaller IBUs in the imputed data set which is possibly indicative of an originally small sample size for those states.
#Given we imputed a predicted value into 1005 NAs in IBU here is the original data for comparison
OrigCombined = merge(BeersCopy,BreweriesData, by.x = "Brewery_id", by.y = "Brew_ID")
colnames(OrigCombined)[colnames(OrigCombined)=="Name.x"] <- "Beer_Name"
colnames(OrigCombined)[colnames(OrigCombined)=="Name.y"] <- "Brewery_Name"
OrigMedianIBUState = OrigCombined %>% group_by(State) %>% summarise(MedIBU = median(IBU, na.rm = TRUE))
OrigMedianIBUState %>% ggplot(aes(x = State, y = MedIBU)) + geom_bar(stat = "identity", fill = "light blue")+
ylab("Median State IBU") + ggtitle("Median Beer IBU by State - Original Values") + theme_clean()
## Warning: Removed 1 rows containing missing values (position_stack).
Below we can view the most alcoholic and most bitter beer.
The most alcoholic beer has an ABV of .128, double the median in many states and often the alcohol level of wine, this beer comes from Colorado.
The most bitter beer has an IBU of 138, well above many other types of beer and many consumer’s preference, this beer is produced in Oregon.
#Which state has the maximum ABV beer.
ABVMax = which.max(BeerWithBreweries$ABV)
print(BeerWithBreweries[ABVMax,c(2,4,8,10)])
## Beer_Name ABV
## 375 Lee Hill Series Vol. 5 - Belgian Style Quadrupel Ale 0.128
## Brewery_Name State
## 375 Upslope Brewing Company CO
#Which state has the most bitter IBU beer?
IBUMax = which.max(BeerWithBreweries$IBU)
print(BeerWithBreweries[IBUMax,c(2,5,8,10)])
## Beer_Name IBU Brewery_Name State
## 1857 Bitter Bitch Imperial IPA 138 Astoria Brewing Company OR
As tackled in the earlier question there is an evident relationship between the bitterness of a beer and it’s alcoholic content, this relationship is the basis of imputing nearly half of the IBU values.
As shown below there is a moderate relationship between IBU and ABV. Andd as explored above there is an estimation .67 correlation between the two variables.
Both the data before and after imputation is visualized as the data after imputation has a much stronger relationship.
#Relationship betwen the bitterness of the beer and its alcoholic content, Scaatter plot and discussing
#Imputed
BeersData %>% ggplot(aes(x = ABV, y = IBU)) + geom_point(color = "light blue") + ggtitle("Beer IBU by ABV - With Imputed Values")
#Original Values
BeersCopy %>% ggplot(aes(x = ABV, y = IBU)) + geom_point(color = "light green") +
ggtitle("Beer IBU by ABV - Original Values")
## Warning: Removed 1005 rows containing missing values (geom_point).
Using a KNN(K Nearest Neighbors) model we will assess the relationship of IBU and ABV and how it can predict between Ales and IPAs.
First we view some exploratory plots and can see how the groups cluster somewhat differently.
#IPA data only for exploration
IPABeers = BeersData %>% select(Name, Beer_ID, ABV, IBU, Brewery_id, Style, Ounces) %>% filter(str_detect(Style, "IPA"))
#Ale data only for exploration
AleBeers = BeersData %>% select(Name, Beer_ID, ABV, IBU, Brewery_id, Style, Ounces) %>% dplyr::filter(!str_detect(Style, "IPA") & str_detect(Style,"Ale"))
#Exploratory plots
IPABeers %>% ggplot(aes(x = ABV, y = IBU)) + geom_point(color = " dark green") +
ggtitle("Beer IBU by ABV for IPAs")
AleBeers %>% ggplot(aes(x = ABV, y = IBU)) + geom_point(color = " blue") +
ggtitle("Beer IBU by ABV for Ales")
Now we can begin to test the data with a KNN model. After visualizing the full Ale/IPA data we build a KNN model with an 80/20 test and train split. We set the K to the square root of the number of observations tested and from there we can build the model.
#Data Set of Just IPA and Ale Beers
AleIPABeers = BeersData %>% filter(str_detect(Style,"IPA") | str_detect(Style,"Ale"))
AleIPABeers = mutate(AleIPABeers, Ale_Type = ifelse(str_detect(AleIPABeers$Style,"IPA"), "IPA", "Ale"))
#Convert the Ale_Type column to a factor
AleIPABeers$Ale_Type <- as.factor(AleIPABeers$Ale_Type)
#Before we go into classifcation lets see the data
AleIPABeers %>% ggplot(aes(x = ABV, y = IBU, color = Ale_Type)) + geom_point() +
ggtitle("Beer IBU by ABV for Ales and IPAs")
set.seed(9850)
#Testing a 80 20 split to train
TestPercent = .8
BeerIndices = sample(1:dim(AleIPABeers)[1], round(TestPercent * dim(AleIPABeers)[1]))
#Normalize variable Function to a Z score
normalizeList <- function(x) (x - mean(x)) / (sd(x))
AleIPABeers_Norm = AleIPABeers %>% mutate(IBU_Norm = normalizeList(IBU)) %>% mutate(ABV_Norm = normalizeList(ABV))
#Set train and test sets
Train_AleIPA = AleIPABeers_Norm[BeerIndices,]
Test_AleIPA = AleIPABeers_Norm[-BeerIndices,]
# A rule of thum is to set k = to the sqrt(total number of obs.) and preferably use an odd number to help knn break a potential tie
bestK = round(sqrt(dim(AleIPABeers)[1])) #this returns 39 so no need to manually adjust to an odd number
# Generate a prediction from the classification of all the values in the test dataframe
m1 <- knn(Train_AleIPA[,c(9:10)], Test_AleIPA[,c(9:10)], Train_AleIPA$Ale_Type, k=bestK)
Once we build the model we can test it against our actual testing values. Below is a confusion matrix and a heat map of our table of predictions against actual values. Overall the model seems to work well with a 81% accuracy rate. It has a higher rate of predicting Ales, likely due to the wide variance of Ales compared to IPAs. It is also possible our earlier imputation of IBUs has affected our results somewhat here.
#Use the target test data set in the X axis, and put m1 in the Y axis
ModelKNN = table(Test_AleIPA$Ale_Type,m1)
confusionMatrix(ModelKNN)
## Confusion Matrix and Statistics
##
## m1
## Ale IPA
## Ale 174 35
## IPA 22 76
##
## Accuracy : 0.8143
## 95% CI : (0.7662, 0.8562)
## No Information Rate : 0.6384
## P-Value [Acc > NIR] : 1.156e-11
##
## Kappa : 0.5874
##
## Mcnemar's Test P-Value : 0.112
##
## Sensitivity : 0.8878
## Specificity : 0.6847
## Pos Pred Value : 0.8325
## Neg Pred Value : 0.7755
## Prevalence : 0.6384
## Detection Rate : 0.5668
## Detection Prevalence : 0.6808
## Balanced Accuracy : 0.7862
##
## 'Positive' Class : Ale
##
Actual = factor(c("Ale", "IPA", "Ale", "IPA"))
Predicted = factor(c("Ale", "Ale", "IPA", "IPA"))
Y = c(156,36,34,81)
plotdf = data.frame(Actual,Predicted,Y)
plotdf %>% ggplot(aes(x= Actual, y = Predicted)) + geom_tile(aes(fill = Y), color= "white") +
geom_text(aes(label = Y), color = "white") + theme(legend.position = "none") + scale_fill_continuous(low ="#696ffa", high ="#161a78")+
ggtitle("Predicted as IPA or Ale by Actual IPA or Ale")
In the next steps of analysis we endeavored to compare producing and consuming states.
In addition to the data we began with we will add some data on the consumption of beer by state and per capita. This data comes from the National Institute of Alcohol Abuse and Alcoholism and their report on alcohol consumption in 2018. We manually took a set of data including per capita rates of consumption of beer in thousands of gallons by state. This data is then loaded and cleaned up below.
#Loading the Beer Consumption Data
consumptionData = read.csv("BeerConsumptionData.csv")
#Adding State Abbreviations
stateTranslate = data.frame(Abb = state.abb,Name = state.name)
#Missing DC
DC = data.frame(Abb = "DC", Name = "Dist.of Columbia")
states = rbind(stateTranslate, DC)
states$Name = as.character(states$Name)
states$Abb = as.character(states$Abb)
states = states[order(states$Name),]
#Organizing Consumption Data
consumptionData$StateAbb = states$Abb
consumptionData = consumptionData[order(consumptionData$StateAbb),]
With consumption data organized we can began to explore the consumption of beer by state by both thousands of gallons and by combining the consumption data with mean ABV values we can see some estimates of actual alcohol consumptions by state.
Overall with our consumption only we can see a few northern states that are heavy beer consumers. But by combining mean ABV with this per capita consumption we can compare consumption by actual alcohol consumed per capita per state.
This analysis highlights Nevada as one higher consuming state with more beers produced with a higher alcohol percentage, while it did not seem to be a large beer consumer by per capita beer consumption alone.
#Mean ABV
ABVMean = BeerWithBreweries %>% group_by(State) %>% summarize(Mean = mean(ABV))
ABVMean$state = ABVMean$State
plot_usmap(data = ABVMean, values = "Mean", color = "white") +
scale_fill_continuous(name = "Mean ABV", low = "#bdffe6", high = "#0f4d36") +
theme(legend.position = "right") + ggtitle("Mean Alcoholic Beverage Value by State")
#Consumption by State
usmapConsumption = data.frame(state = consumptionData$StateAbb, full = consumptionData$State, perCap = consumptionData$PerCapita.)
plot_usmap(data = usmapConsumption, values = "perCap", color= "white") +
scale_fill_continuous(name = "Per Capita Consumption", low = "#d7e2fa" , high = "#122654") +
theme(legend.position = "right") + ggtitle("Consumption of Beer in Thousands of Gallons Per Capita by State")
#ConsumptionEthanol
#Estimate because we dont know these beers are an exhaustive list
usmapConsumption$meanABV = ABVMean$Mean
usmapConsumption = usmapConsumption %>% mutate(TotalEthanolConsumption = (meanABV * perCap)*1000)
plot_usmap(data = usmapConsumption, values = "TotalEthanolConsumption", color= "white") +
scale_fill_continuous(name = "Per CapitaConsumption", low = "#d7e2fa" , high = "#122654") +
theme(legend.position = "right") + ggtitle("Average Consumption of Alcohol through Beer in Gallons Per Capita by State")
Finally we can address the gap between states who produce many types of beer and those who consume more. Quantifying this gap gives us a chance to build a list of states that can be targeted as low producers who may be good targets for more locally produced beers owned by Budweiser larger producing states may offer an opportunity for Budweiser to acquire some locally produced beers to build a larger foothold in the more active beer producing states.
Below we have a graph of which states produce the most unique beers, again Colorado and California top the list. Following is the visualization comparing the per capita consumption rate of beer in thousands of gallons by the number of unique beers produced by state. The resulting quantity provides us a measure of producing states and consuming states.
The darker states are producing beers for their consumers and represent an opportunity for acquisition of local beers to enter the local production scene.
The lighter states in comparison are states with few locally produced beers but may have decent consumption rates. Here Budweiser may have a lower competition market opportunity.
#Count of Beers per State
CountofBeers = BeerWithBreweries %>% group_by(State) %>% summarise(Count = n())
CountofBeers$state = CountofBeers$State
plot_usmap(data = CountofBeers, values = "Count", color = "white") +
scale_fill_continuous(name = "Unique Beers Produced", low = "#bdffe6", high = "#0f4d36") +
theme(legend.position = "right") + ggtitle("Number of Unique Beers Produced by State")
#Compare Production with Consumption
ComparedData = data.frame(usmapConsumption, UniqueBeers = CountofBeers$Count)
ComparedData = ComparedData %>% mutate(Producers= UniqueBeers / perCap)
#Producers and Consumer States
#Light green are states with best opportunities for introducing new beers
#Darker states are producers
plot_usmap(data = ComparedData, values = "Producers", color= "white") +
scale_fill_continuous(name = "Unique Beers by Consumption", low = "#d1ffe7" , high = "#2d4036") +
theme(legend.position = "right") + ggtitle("Number of Unique Beers Prodcued by Per Capita Consumption by State")
Overall we show there is an interesting relationship between ABV and IBU that can be used to categorize some styles of beer.
There is business opportunity in identifying states with small amounts of beer produced but high consumption as lower competition states where market share may be gained at a low cost. Alternatively states with a high production are potential states to seek acquisitions in if Budweiser is looking to widen it’s portfolio of smaller local beers.
Comment on the summary statistics and distributions of the ABV variable.
Below we can view the histogram and summary of the ABV variable. The distribution is somewhat right skewed with many of the values around .05, there are some larger values, like our max of .128, that skew the data.